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The problem of multi-agent learning and adaptation has attracted a great deal of attention in recent years. It 
has been suggested that the dynamics of multi agent learning can be studied using replicator equations from pop- 
ulation biology. Most existing studies so far have been limited to discrete strategy spaces with a small number 
of available actions. In many cases, however, the choices available to agents are better characterized by contin- 
uous spectra. This paper suggests a generalization of the replicator framework that allows to study the adaptive 
dynamics of Q-learning agents with continuous strategy spaces. Instead of probability vectors, agents strategies 
are now characterized by probability measures over continuous variables. As a result, the ordinary differential 
equations for the discrete case are replaced by a system of coupled integral-differential replicator-equations that 
describe the mutual evolution of individual agent strategies. We derive a set of functional equations describing 
the steady state of the replicator dynamics, examine their solutions for several two-player games, and confirm 
our analytical results using simulations. 

PACS numbers: 



I. INTRODUCTION 

The notion of autonomous agents that learn by interacting with the environment, and possibly with other agents, is a central 
concept in modern distributed AI 1 18]. Of particular interests are systems where multiple agents learn concurrently and indepen- 
dently by interacting with each other. This multi-agent learning problem has attracted a great deal of attention due to number 
of important applications. Among existing approaches, multi-agent reinforcement learning (MARL) algorithms have become 
increasingly popular due to their generality 13, [sl |a, |T3j . Although MARL does not hold the same convergence guarantees as in 
single-agent case, it has been shown to work well in practice. 

From the analysis standpoint, MARL represents a complex dynamical system, where the learning trajectories of individual 
agents are coupled with each other via a collective reward mechanism. Thus, it is desirable to know what are the possible 
long-term behaviors of those trajectories. Specifically, one is usually interested whether, for a particular game structure, those 
trajectories converge to a desirable steady state (called fixed points), or oscillate indefinitely between many (possibly infinite) 
meta-stable states. While answering this question has proven to be very difficult in the most general settings, there has been 
some limited progress for specific scenarios. In particular, it has been established that for simple, stateless Q-learning with finite 
number of actions, the learning dynamics can be examined using the so called replicator equations from population biology fl. 
Namely, if one associates a particular biological trait with each pure strategy, then the adaptive learning of (possibly mixed) 
strategies in multi-agent settings is analogous to competitive dynamics of mixed population, where the species evolve according 
to their relative fitness in the population. This framework has been used successfully to study various interesting features of 
adaptive dynamics of learning agents l,lZ.,15,il6i,fT9ll . 

Most existing studies so far have focused on discrete action spaces, which has limited the full analysis of the learning dynamics 
to games with very few actions. On the other hand, in many practical scenarios, strategic interactions between agents are better 
characterized by continuous spectra of possible choices. For instance, modeling an agent's bid in an auction with a continuous 
rather than discrete variable is more natural. In such situations, agents strategies are represented as a probability density functions 
defined over a continuous set of strategies. Of course, in reality all the decisions are made over a discretized subset. However, 
the rationale for using the continuous approximation is that it makes the dynamics more amenable to mathematical analysis. 

In this paper we consider simple Q-learning agents that play repeated continuous-strategy games. The agents use Boltzmann 
action-selection mechanism that controls the exploration/exploitation tradeoff by a single, temperature-like parameter. The 
reward functions for the agents are assumed to be functions of continuous variables instead of tensors, and the agent strategies are 
represented as probability distribution over those variable. In contrast to the finite strategy spaces where the learning dynamics 
is captured by a set of coupled ordinary differential equations, the replicator dynamics for the continuous-strategy games are 
described by functional-differential equations for each agent, with coupling across different agents/equations. 

The long-term behavior of those equations define the steady-state, or equilibrium, profiles of the agent strategies. It is shown 
that, in general, the steady state strategy profiles of the replicator dynamics do not correspond to the Nash equilibria of the game. 
This discrepancy can be attributed to the limited-rationality of the agents due to the non-zero temperature (i.e., exploration). 
Furthermore, it is shown that the correspondence with the Nash equilibria is often recovered if one gradually decreases the 
temperature. However, for certain games, the replicator dynamics might converge to strategy profiles that have non-zero entropy 
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even if the limit of perfectly rational agents, and for which there is no corresponding Nash equilibria in the zero-temperature 
limit. 

The rest of this paper is organized as follows: In the next section we provide a brief overview of relevant literature. In 
Section |III] we introduce our model, derive the replicator equations for the continuos strategy spaces, and a set of coupled 
non-linear functional equations that describe the steady state strategy profile. In Section |IV] we illustrate the framework on 
several examples of two-agent games, and provide some detailed results for general bi-linear and quadratic payoffs. Finally, we 
conclude the paper with a discussion of our results and possible future directions in SectionlVl 

n. BACKGROUND AND RELATED WORK 

Reinforcement learning (RL) lUllH] is a powerful framework in which an agent learns to behave optimally through a trial and 
error exploration of the environment. At each step of interaction with the environment, the agent chooses an action based on 
the current state of the environment, and receives a scalar reinforcement signal, or a reward, for that action. The agent's overall 
goal is to learn to act in a way that will increase the long-term cumulative reward. Although RL was originally developed for 
single-agent learning in stationary environment, it has been also generalized for multi-agent scenario. In the multi-agent setting, 
the environment is highly dynamic because of the presence of other learning agents, and the usual conditions for convergence 
to an optimal policy do not necessarily hold iHIItIIOI- Nevertheless, various generalizations of single agent learning algorithms 
have been successfully applied to multi-agent settings. 

Despite some empirical success, theoretical advances in multi-agent reinforcement learning have been rather scarce. Recent 
work has suggested to utilize the link between MARL and replicator dynamics from the population biology. Those equations has 
demonstrated very rich and complex behavior, such as sensitivity to initial conditions, and even Hamiltonian chaos [15, 16]. A 
similar approach was used in 1 19], where the Cross Learning model of Ref. [1] was extended to Q-learning, and where the link 
between multi-agent replicator dynamics and Evolutionary Game Theory was reiterated. The replicator dynamics framework 
was also used in 1 1 2] to study advantages of lenient learners in repeated coordination games, where some convergence guarantees 
on certain games were obtained. 

In addition to discrete strategy spaces, recent work has addressed games with continuous strategy spaces. For instance, 
continuous strategy version of the prisoner dilemma has been considered in [l^ |2^ |2Tt] . Replicator equations in continuous 
strategy spaces have been studied in the context of evolutionary and dynamical stability in fH, 0, Hill. The corresponding 
replicator equation is similar to the one studied in the present paper, except there is no entropy (mutation) term. This is an 
important distinction since without the entropy term, the allowed steady state solution are confined to a set of Dirac's (5-measures 
at distinct points. In other words, even if one starts with a continuous population, the replicator dynamics will converge to 
monomorphic fixed points. 

The work that is closest to one presented here is Ref [14], which studied continuous strategy replicator system with and 
without mutation. They established that for specific games, mutations resulted in non-trivial modification to the equilibrium 
structure. The difference between their work and the model presented here is the origin of the mutation term - while in Ref ifTill 
the mutation was added as a generic diffusive term, in our model it has a very intuitive entropic meaning that results from 
the Boltzamn action selection mechanism. Specifically, the replicator equation in our case results from minimizing a certain 
functional that is reminiscent of the so called /ree energy from statistical physics. This fact provides a very intuitive picture of 
the steady state structure. 

III. MODEL 

We consider a system of N agents that are playing repetitive games with each other. In the present paper, we assume a stateless 
model, so that the reward for each agent depends only on the collective action of the agents. Let Xi denote the action taken by 
the i-th agent. Also, let x^i denote the collective action profile of all the agents except i. Without a loss of generality, we assume 
that the actions are restricted to the unit interval, Xi e [0, 1], i = 1, 2, ..N. 

The game proceeds as follows: At each time step, each agent chooses an action, receives a reward that depends on the collec- 
tive action of all the agents, and update his strategies accordingly. Each agent has a Q-function that encodes the relative utility 
of actions. Those Q-functions are updated after each time an agent selects an action, according to the following reinforcement 
rule Ii221 : 

Q^{xi,t + ^t) = Qt{xi,t) + a[f{xi;x-i) - Qi{xi,t)] (1) 
where fi{xi] x^i) is the reward of the agent i when he takes the action Xi and the rest of the agents take the collective action 

X — i- 
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Next, we have to specify how agents choose actions based on Q-functions. There are several action-selection mechanisms. 
Here we focus on the so called Boltzmann exploration, where the probability of selecting a particular action Xi is given as 



where C (t) is simply a (time-dependent) normalization constant: 



C{t) 



(2) 



(3) 



and where (3 = 1/T is a parameter that controls the exploration/exploitation tradeoff, and has a meaning of inverse temperature 
owing to the analogy with statistical-mechanical systems. 

We now assume that the agents interact many times between two consecutive updates of their strategies. In this case, the 
reward of the i-th agent in Equation [1] should be understood in terms of the average reward, where the averaging is done over 
the strategies of other agents in the system. Specifically, taking the limit 5t — > 0, we can rewrite Equation[T]as follows: 



dt 



where (xi) is the average reward "felt" by the i-th agent: 

ri{x) = J . . . J Y\_dxjPj{xj)fi{xi,X-i) 



(4) 



(5) 



We want to eliminate Q from the dynamics, so that the learning trajectories are expressed solely through the agent strategies. 
To achieve this, let us take the derivative of Equation|2]in respect to time: 



dpi 
dt 



dC 
'dt 



dt 



Note that 



dC 
'dt 



[Jdxel^Q]^ 

Combining Equations |2]lll and|7] we arrive at the following 

1 ^p^{x,t) 



JdxPe^QdQ/dt _ dxp,{x,t)dQ/dt 



Pi{x,t) d{at) 



ri{x) - / dxri{x)pi{x,t) 



T 



\TLpi{x,t) - dxpi{x,t)ln{pi{x,t)) 



(6) 



(7) 



(8) 



Equations |8] have a very simple interpretation. Indeed, the first term suggests that a probability of a playing a particular 
pure strategy increases with a rate proportional to the overall efficiency of that strategy. This is reminiscent of of fitness-based 
selection mechanism in population biology. The second term, on the other hand, does not have a direct analogue in population 
biology, and describes the agents' tendency to randomize over the strategies. 



A. Steady State Solution 

We are interested in the asymptotic behavior of the Equations [8] after sufficiently long time, Pi{xi) — pi{xi,t — *■ oo). The 
steady-state equation is obtained by setting the time-derivative to zero, which yields a set of equations (for each agent) 

P^ (x) [Ri {x,t)-T In (x) + co7ist] ==0, (9) 

where Ri {x) depend on the collective strategy profile of all the agents as described by Equation|5] with pi {x) replaced by Pi {x). 
Let us focus on solutions Pi{x) that have continuous support on the interval [0, 1], i.e., Pi{x) is strictly positive for all x £ [0, 1], 
except maybe a finite number of poins. Since Equation |9] should be satisfied for all the values xt £ [0, 1], then the expression in 
the parenthesis should nullify for all x, which yields for « = 1, 2, ..iV : 



(10) 
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Equations [To] are coupled, highly non-linear integral equations, whose solution determine the steady state strategy profiles 
of the agents. The coupling enters non-trivially through the average rewards received by the agents. Note, that in the single- 
agent learning scenario, Ri{x) can be viewed as the reward received by the agent for playing the strategy x. In this case. 
Equation [To] suggests that the steady state strategy profile is given by the Gibbs-like measure for a system with energy —Ri{x) 
and temperature 1/(3, and can be derived from the following considerations. Let us define the following functional; 

$ is so called free energy from statistical physics. It is easy to check that the Equation[TOlminimizes $, subject to the condition 
that p{x) is normalized. Indeed, introducing a Lagrange multiplier A to account for the normalization constraint, and taking the 
functional derivative with repsect to p, we obtain 

S^p{x)] = I dx[R{x) -T\np{x)) ~ A]5p(a;) = (12) 



which again yields [TO] This suggests that for any non-zero temperature, the replicator dynamics will lead to a steady state that 
has non-zero entropy. In the terminology of population biology, non-zero temperature guarantees population diversity, and the 
monomorphic solutions are not allowed. The actual degree of diversity if governed by the temperature. Below we examine this 
question in more details. 

IV. EXAMPLES 

Owing to the highly non-linear nature of the steady state equations, they cannot be solved analytically in the most general 
case of arbitrary payoffs. However, one can still establish important results for certain class of games. In the reminder of the 
paper we focus on several such examples. We will limit our consideration to two-player games. Let x and y denote the actions 
of each agent. The system[TO]then has the following form: 

Px{x) = Aie^^i(^) EE ylie'^-/''*^^i(^'^)-^=(^) (13) 
P^iv) = Aze''^^^^) = A2e'^-/''^^^=("^'^)'^i(^) (14) 

where A\, A2 are normalization constants. 

A. Bi-Linear Payoff 

We first consider games with symmetric bi-linear payoffs structure, which, in the general case, can be written as follows: 

fi[x,y) = aixy + bix 

h{x,y) = a2xy + b2y (15) 

It is simple to see that the average reward of the agents x is a linear function of x: 

Ri{x) = x{aiy + bi) (16) 
R2{x) = y{a2X + b2) (17) 

where x, y are the means of respective distributions. This suggests the following form for the steady state solutions: 

Pi{x) oze'''',P2{y) oie^^y (18) 

where 71 , 72 can be found from a self-consistency condition. For the symmetric case, ai — a2 = a and 61 = 62 = b, this 
condition yields 

|- = ff(72), ^ = 5(71) (19) 



where the fuction 5(7) is defined as follows: 



l{l)^b + a- (20) 

[ 1 — e T 7 
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As we show below, the nature of the solution strongly depends on the choice of the parameters. We now examine this 
dependence in more details. First, let us focus on the symmetric solutions, 71 = 72 = 7, in which case Equation[T9lbecomes 



1 



ah) 



(21) 



Graphical illustration of Equation |2T| is presented in Figure[T] where we plot 5(7) together with the line 7//? for different 
values of b. Let us first consider the case a > (Figure [T(a)l ). In this case, 3(7) is a monotonically increasing function of its 
argument, which tends to 6 as 7 —> —00, and to 6 + a as 7 cxo. A simple inspection shows that for 6 > 0, the equation 
has a unique solution that increases almost linearly with the inverse temperature /3. Similarly, for b < —a, the equation again 
has a unique solution, which decreases linearly with increasing f3. Thus, the steady state strategies are peaked around x,y — 
(x, y ^ 1) for b < —a, (6 > 0), with the width of the distribution shrinking linearly with /3. The situation is different for the 
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FIG. 1: (a) Graphical representation of Equation 1211 for a — 1 and different values of b, as shown. The straight line has a slope ^; (b) Steady 
state strategy profiles for a = 1, 6 = 0. 




intermediate values of b. Indeed, it can be seen from Figure [T(a)] that for —a < b < 0, there is a critical value Pc so that for 
any f3 > [ic, there are three distinct solutions. A simple analysis shows that the solution in the middle is unstable, while two 



6 



other solutions, one always positive and the other negative (encircled in Figure [T(a)] i are stable. Furthermore, it is easy to check 
that asymptotic behavior of the solutions for large values of /3 is 7 « ±/3|&|. Thus, in zero-temperature limit /3 the steady 
state strategy profiles are ^-functions at either or 1, depending on the initial conditions. This prediction was confirmed in our 
simulations in Figure [T(b)] where we show the steady-state strategy profile for a game with a = 1 and 6 = 0. The open symbols 
are the results of simulations, which agree very well with the analytical prediction. For this particular case, the steady-state 
strategy is peaked around the Nash equilibrium point x = y = 1. Following to the discussion above, it is clear that while 
increasing (3, p{x) will tend to a point mass at a; = 1. 

Now consider the case a < 0. The corresponding 5(7) is shown in Figure [2(a)| Since 9(7) is a strictly decreasing function. 
Equation |2T| has only one solution, for any values of b and f3. However, a simple analysis reveals that depending on the value 
of b, the asymptotic behavior of the solution for large /3 is different. Indeed, for |6| > |a|, the solution behaves asymptotically 
as 7 « ±|6|/3. Thus, the zero-temperature steady state strategy profile again again corresponds to point mass at x = or 
X = 1, depending on the initial conditions.. For the intermediate values of |6| < —a, on the other hand, the solution is almost 
independent of /3, as it can be seen from the graph. In this case, the steady state strategies have continuous support for any 
f3, and do not converge to the monomorphic state that corresponds to the Nash equilibrium at x* = y* = 0. This behavior is 
depicted in Figure [2(b)] where we plot the steady state strategy profile for different values of f3. Note that increasing /3 makes 
the distribution more peaked initially, but it saturates for sufficiently large (3. This is demonstrated in the inset of Figure |2(b)| 
which shows the dependence of 7 on the inverse temperature. Thus, even in the limit f] ^ 00, the solution will always have a 
finite (non-zero) entropy, and will not converge to a monomorphic state. 

So far our analysis has focused on the symmetric solutions to the steady state equations [19] For sufficiently small f3, the 
symmetric solution is the only one. However, above some critical value of f3, another, asymmetric solution appears (strictly 
speaking, there are two solutions related hy x ^ y symmetry). This is shown in Figure [3(a)] where we compare analytical and 
simulation results for a particular value of (3. In simulations, we had to start from asymmetric initial strategies in order to arrive 
at the asymmetric steady state. The structure of the solution can be summarized as follows: While increasing /3, the strategy 
around narrows much faster compared to the strategy around 1. This is also shown in Figure [3(b)] where we plot the bifurcation 
diagram of the solution to Equations [19] For small (3, there is only a symmetric solution, 71 = 72 = 7. Starting from a critical 
/3, the asymmetric solution appears, with 71 and 72 diverging from each other We note that the symmetric solution exists at any 
value of /3, as shown by the dashed line. 
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FIG. 3: (a) Asymmetric steady-state strategy profiles for the bi-linear game; (b) The "bifurcation" diagram of the eg nation [79] 

To understand the nature of the asymmetric solution, we make the following observation: Consider the discrete map 

zt+i = Pgizt) (22) 

Clearly, the symmetric solutions to Equation|19|correspond to the fixed points of this map, z* — (3g{z*). Furthermore, it is easy 
to see that asymmetric solutions of Equation[T9] if they exist, correspond to the attractors of the map[22]with period T = 2. And 
conversely, if Equation [T9l allows an asymmetric solution, then the map [22] necessarily has an attractor with period T = 2. A 
simple inspection shows that for a < 0, |&| < \a\ such attractors exist, as shown schematically in Figure[4] 
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FIG. 4: Illustration of an attractor with period T = 2. The circle represent corresponding solutions 71, 72. 

B. Quadratic Payoff 

We now consider another important class of games where the payoffs can be expressed through the following quadratic forms: 

fi{x,y) = -{x + y - 2aif 

f2{x,y) = -ix + y-2a2f (23) 

Here < ai, 02 < 1, and generally speaking, ai ^ 02. Let Pi{x) and P2{y) be the steady state strategy profiles corresponding 
to this payoff. It is easy to check that the average rewards are given as follows: 



Ri{x) 
R2{y) 



'{x + y-2aif - (?/2 
-{y + x-2a2f - (x^ 



f) 



This suggests the following form for the steady state strategies: 



Pi{x) = c(.To)e-'3("""°)' 

P2{y) = c{yo)e-^^y-y''^" 



(24) 
(25) 



(26) 
(27) 



where c{xo), c{yo), are the respective normalization factors, and the function c{z) can be expressed through the error functions 
as follows: 



c(z) = 2\ll[erf{^z) + er/(y^(l - z))]-^ 



Combining Equations fT4l [25] and |27l we find the following transcendental equations for the parameters xq an y^: 

xq = 2ai - ^(yo) 
yo = 2a2 - ^J.{xQ) 

where the function /i(z) is the mean of a truncated Gaussian distribution centered at z, and is given as follows: 



fi(z) = z — c{z)- 



2/3 



(28) 



(29) 
(30) 



(31) 



Let us first analyze the symmetric case ai — a2 = a. Note that in tis case the game becomes a pure coordination game, which 
has continuously many Nash equilibria given by x* + y* = 2a. Furthermore, it is easy to check that there is no mixed Nash 
equilibrium. To see why this is the case, assume the contrary, and let Pi (x) be the mixed strategy of the first agent. Then, using 
Equation|25] it is straightforward to show that the best response of the second agent is to play a pure strategy Pi (x) = 6{x — x*) 
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with X* = 2a — y. However, if the second agent plays according to this pure strategy, then the first agent will do better by 
playing a pure strategy as well, P2 (y) = d{y — y). 

Let us now consider the corresponding steady state structure within the replicator framework. The steady state strategy 
profiles are (truncated) normal distributions centered at points (xo, yo) which need to be found from the system of transcendental 
Equations [29]- UT\ We now analyze this system in more details. First of all, we check for a symmetric solution xq — yo, for 
which the system of equations reduces to the following equation: 



g-/3(l-xo)" _ g-0xl 

Xo = 2a - fi{xo) = a + c{xo) --5 (32) 

4p 

Graphical representation of Equation |32] is shown in Figure [5(a)] An inspection confirms that a symmetric solution is present 
for any value of the inverse temperature f3. The actual steady state strategy profiles corresponding to different values of /3 are 
shown in Figure [5(b)] together with simulation results. Note that in the limit /3 ^ 00 one has xq = yo = suggesting that the 
strategy profiles of both agents become peaked around a in the limit of large f3. 




FIG. 5: (a)Graphical representation of Equation 1321 (b) Steady-state strategy profiles for the coordination game with o = 0.5 and for three 
different values of the inverse temperature /?. 



The picture above suggests that in the zero temperature limit f3 00, the steady state strategy profiles sholud correspond 
to the symmetric pure strategy Nash equiUbrium x* = y* = a. Indeed, our results suggests that if one starts in the vicinity 
of the symmetric equilibrium, then this picture holds. However, further analysis reveals that the convergence to this symmetric 
equilibrium is not trivial. To understand why this is the case, let us consider again the steady state equations[29]and[30l and look 
for the solutions of form xo + yo = 2a. This leads to the following equation: 

Xo = 2a — ^{2a — xq) (33) 

The graphical representation of Equation [33] is shown in Figure [6] An inspection of this equation for various f3 yields the 
following observation: the symmetric solution xo = yo = o. is the only solution, and for relatively small values of /3, the 
replicator dynamics settles into this symmetric solution relatively quickly. However, increasing (3 leads to the appearance of 
continuously many metastable states that are characterized by the condition x + j/ ~ 2a. Thus, when the system starts close to 
those metastable states, it can get trapped there for very long times, converging very slowly to the actual steady state. In fact, 
the convergence times diverges as /3 — > 00. Also, note that the metastable state that the system will fall into will depend on the 
initial conditions. 

Finally, we consider the asymmetric case ai ^02. It can be shown that any asymmetry, however small, drastically changes the 
structure of the Nash equilibria. Consider, for instance, the following perturbation of the game considered above: ai — a — Sa, 
02 — a + 6a, with a — 0.5. It is clear that for any (5a, no matter how small, the asymmetry leads to a single pure Nash equilibrium 
at{x*,?;*} = {0,1}. Our simulations show that similar behavior persists for finite /3, and that the agents' strategies drift towards 
the deterministic equilibrium points when one increases /3, as shown in Figure [7] 
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FIG. 7: Steady-state strategy profiles for the asymmetric quadratic game witii parameters ai — 0.45, a2 = 0.55, and for two values of the 
inverse temperature (3 = 10, 50. 



C. The Political Advertisement Game 



In addition to simple bi-Iinear and quadratic games, where analytical examination of the steady state structure was possible, 
we considered other games for which the solution of the steady state equation cannot be obtained analytically, so one has to 
use numerical techniques. As an example, we consider the so called political advertisement game, were two political parties 
decide their expenditure levels x and y for campaign advertisement. The total number of votes participating in the election is 
proportional to the collective expenditure, and the fraction of votes each party gets is proportional to individual expenditures. 
Thus, the payoff has the following structure: 

fiix,y) = —^ X, /2(a;,y) = — 7 y, (34) 

X + y X + y 

It is easy to show that this game has a single non-trivial Nash equilibrium at x* ^ y * = 1/4. Furthermore, a simple inspections 
shows that there are no mixed Nash equilibria. 

We studied the above game using both numerical solution to the replicator dynamics equations, as well as actual simulations 
of the game. In Figure [8(a)| we plot the steady state strategy profile of the agents the advertisement game. Again, the solution is 
symmetric. The strategy profile of each agent seems to centered around the pure Nash equilbrium, with the exact shape of the 
density depending on the parameter (3. The average strategy of an agent, defined as Xavg — J dxPi {x)x is close, but not equal to 
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FIG. 8: Steady-state strategy profiles for the political advertisement game: (a) Average strategy (expenditure) of an agent plotted against the 
inverse temperature (5. The horizontal line corresponds to the Nash equilibrium x* = 1/4. 



D. Investment Game 



Our last example is a game for which the steady state of the replicator dynamics coincides with the Nash Equilibirum . 
Consider a model of two-firm competition where each firm chooses an investment levels from the unit interval. The firm with 
the highest investment wins the market, which has unit value, with ties broken randomly. Denoting the investment levels of the 
players as x and y, the payoff structure has the following form: 

{1 — X if X > y] 
-X \fx< y; (35) 
i — X if a; = y. 

with a similar (symmetric) payoff for the second agent. Note that the game does not have a pure Nash equilibirum. Indeed, 
assume the contrary, and let x* be the equilibrium strategy of the first agent. If x* < 1, then the second agent will do better by 
playing y — x* + e < 1, while if x* = 1, then he will be better off playing y = 0. At the same time, one can show that there is a 
Nash equilibrium where both players mix uniformly over the pure strategies. It is straightforward to show that the same strategy 
profiles are also the steady state solutions of the replicator equation. Indeed, note that for a strategy profile P{y) of the second 
agent, the average reward of the first one is given as given as 

Ri{x)^^x+ I P{y), (36) 
Jo 

Then a simple inspection shows that choosing P{x) = P{y) = 1 solves the steady state equations. This is shown in Figure|9] 
where we also show the simulation results. 

Note that there is an intuitive argument why the steady state of the replicator dynamics coiTesponds to the Nash equilibrium 
for this particular game. Indeed, recall that the Nash equilibrium minimizes the energy, while the replicator dynamics minimizes 
the free energy functional [TT] which, at non-zero temperature, is different from the energy. Those two minimization objectives 
will generally yield different strategy profiles. For this particular game, however, the strategy profile that minimizes the energy 
also happens to maximize the entropy (and thus, minimize the entropic term in the free energy), so that minimization of either 
objective yields the same result. 
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FIG. 9: Steady-state strategy profiles for the Investment Game. 



V. CONCLUSION 



In this paper we presented a generalization of the replicator dynamics framework to study the dynamics of multi-agent 
reinforcement learning with continuous strategy spaces. We presented a set of differential-functional equation that describe 
the adaptive learning dynamics in multi-agent settings. We also derived a set of equations that characterize the steady state 
strategy profile of the learning agents. We demonstrated the analytical framework on several examples, and obtained an excellent 
agreement with the simulations results. 

It was shown, both theoretically and through simulations, that for the Boltzmann exploration mechanism, the long-term 
Umit of the repUcator dynamics does not necessarily correspond to the Nash equiUbria of the corresponding game-theoretical 
system. The reason for this is the bounded rationality of the agent as characterized by the exploration noise in their strategies. 
Specifically, the replicator dynamics at non-zero temperature minimizes the/ree energy, and not the energy, which corresponds 
to the Nash equilibria for rational agents. We demonstrated on several examples that the Nash equilibrium is often recovered 
from the repUcator dynamics in the zero temperature hmit /3 ^ oo. In this Umit, the strategies generaUy become a J-measures 
peaked at the Nash equilibria. However, as it was shown on the example of the bi-linear game, this is not always the case. In 
particular, for some games the replicator equations might yield a steady state solutions that have continuous support even if the 
Umit /? ^ oo, and which do not have a corresponding Nash equilibrium in that limit. Finally, note that for one of election game 
considered here, the steady state of the replicator dynamics coincides with the Nash equiUbria for arbitrary /3: The underlying 
reason for this is the strategy profile that minimizes the energy (NE) also happens to maximize the entropy, so that the the same 
strategy profile minimizes both energy and free energy. More generally, one can say that any uniform Nash equilibrium also 
serves as a steady state for the replicator dynamics. 

There are several important directions to pursue this work further. First of all, it will be worthwhile to perform more formal 
analysis of the steady state equations, and examine the issues of existence and uniqueness of the solutions depending on particular 
payoff structure. Furthermore, we intend to extend our analysis beyond two-player games, and specifically, consider games with 
a very large number of agents, ^ 1, where statistical-mechanical approaches for analyzing the solution structure might be 
appropriate. Finally, another diection of further research is to generalize the stateless Q-leaming model considered here to a 
more general model which will account for different states and probabilistic transitions between them. We believe that such 
generaUzation will be possible by complementing the repUcator equations with equations that describe the Markovian evolution 
of the states. 
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